Project Description¶

In this project, we will analyze and gain useful insights from data related to the biggest American football event: the Super Bowl. Our goal is to explore and understand different aspects of the event, such as game outcomes, TV viewership, and halftime performances.

We will apply data manipulation and visualization techniques to work with real-world data. This will help us uncover interesting patterns and trends related to game results, television audiences, advertising costs, and halftime shows.

The data we are using was collected and cleaned from Wikipedia. It includes three CSV files with information about 52 Super Bowl games played up to the year 2018. By analyzing these datasets, we aim to understand the impact of the games on viewership, the performance of teams, and the entertainment aspect of halftime shows.


The Data¶

We are provided with three datasets. Below is a summary of each:

1. halftime_musicians.csv¶

This dataset contains information about the artists who performed at the halftime shows during various Super Bowl games.

Column Description
super_bowl The Super Bowl number (for example, 52 stands for Super Bowl LII).
musician Name of the musician or music group that performed during the halftime show.
num_songs Number of songs performed during the halftime show.

2. super_bowls.csv¶

This dataset includes detailed information about each Super Bowl game, such as:

  • The date and location of the game
  • The teams that played
  • The final scores
  • The difference in points between the winning and losing team (difference_pts)

3. tv.csv¶

This dataset provides television-related information for each Super Bowl, including:

  • Viewership numbers
  • Household ratings
  • Cost of advertisements

We will now begin our analysis to discover what makes the Super Bowl such a major event from both a sports and media perspective.

In [1]:
# Import required libs
import pandas as pd
from matplotlib import pyplot as plt

# Load the CSV data
tv = pd.read_csv("tv.csv")

# Display the data
tv.head()
Out[1]:
super_bowl network avg_us_viewers total_us_viewers rating_household share_household rating_18_49 share_18_49 ad_cost
0 52 NBC 103390000 NaN 43.1 68 33.4 78.0 5000000
1 51 Fox 111319000 172000000.0 45.3 73 37.1 79.0 5000000
2 50 CBS 111864000 167000000.0 46.6 72 37.7 79.0 5000000
3 49 NBC 114442000 168000000.0 47.5 71 39.1 79.0 4500000
4 48 Fox 112191000 167000000.0 46.7 69 39.3 77.0 4000000

Has TV viewership increased over time?¶

In [2]:
# Find the year with the highest TV viewership
plt.plot(tv.super_bowl, tv.avg_us_viewers)
plt.title('Average Number of US Viewers')
Out[2]:
Text(0.5, 1.0, 'Average Number of US Viewers')
No description has been provided for this image
In [3]:
viewership_increased = True
print(f"Super Bowl viewership increased over time.")
Super Bowl viewership increased over time.

How many matches finished with a point difference greater than 40?¶

In [5]:
# Load the data
super_bowls = pd.read_csv("super_bowls.csv")

# Display the Super Bowls data
super_bowls.head()
Out[5]:
date super_bowl venue city state attendance team_winner winning_pts qb_winner_1 qb_winner_2 coach_winner team_loser losing_pts qb_loser_1 qb_loser_2 coach_loser combined_pts difference_pts
0 2018-02-04 52 U.S. Bank Stadium Minneapolis Minnesota 67612 Philadelphia Eagles 41 Nick Foles NaN Doug Pederson New England Patriots 33 Tom Brady NaN Bill Belichick 74 8
1 2017-02-05 51 NRG Stadium Houston Texas 70807 New England Patriots 34 Tom Brady NaN Bill Belichick Atlanta Falcons 28 Matt Ryan NaN Dan Quinn 62 6
2 2016-02-07 50 Levi's Stadium Santa Clara California 71088 Denver Broncos 24 Peyton Manning NaN Gary Kubiak Carolina Panthers 10 Cam Newton NaN Ron Rivera 34 14
3 2015-02-01 49 University of Phoenix Stadium Glendale Arizona 70288 New England Patriots 28 Tom Brady NaN Bill Belichick Seattle Seahawks 24 Russell Wilson NaN Pete Carroll 52 4
4 2014-02-02 48 MetLife Stadium East Rutherford New Jersey 82529 Seattle Seahawks 43 Russell Wilson NaN Pete Carroll Denver Broncos 8 Peyton Manning NaN John Fox 51 35
In [6]:
# Filter the data for point difference >40
difference = len(super_bowls[super_bowls["difference_pts"]>40])

print(f"The matches finishing with a point difference over 40 were: {difference}")
The matches finishing with a point difference over 40 were: 1
In [7]:
# We can also plot a histogram of point differences to visualize the result
plt.hist(super_bowls.difference_pts)
plt.xlabel('Point Difference')
plt.ylabel('Number of Super Bowls')
plt.show()
No description has been provided for this image

Who performed the most songs in Super Bowl halftime shows?¶

In [8]:
# Load the CSV data 
halftime_musicians = pd.read_csv("halftime_musicians.csv")

# Display the data
halftime_musicians.head()
Out[8]:
super_bowl musician num_songs
0 52 Justin Timberlake 11.0
1 52 University of Minnesota Marching Band 1.0
2 51 Lady Gaga 7.0
3 50 Coldplay 6.0
4 50 Beyoncé 3.0
In [9]:
# Count halftime show songs for each musician
halftime_appearances = halftime_musicians.groupby('musician').sum('num_songs')
halftime_appearances = halftime_appearances.sort_values('num_songs', ascending=False)

halftime_appearances
Out[9]:
super_bowl num_songs
musician
Justin Timberlake 90 12.0
Beyoncé 97 10.0
Diana Ross 30 10.0
Grambling State University Tiger Marching Band 79 9.0
Bruno Mars 98 9.0
... ... ...
Doc Severinsen 4 0.0
Southeast Missouri State Marching Band 5 0.0
Ella Fitzgerald 6 0.0
San Diego State University Marching Aztecs 22 0.0
Judy Mallett 8 0.0

111 rows × 2 columns

In [10]:
most_songs = "Justin Timberlake"

Note that we can explore the datasets more to answer as many questions as possible! Thank you.¶